Exploratory Data Analysis: Paris trees¶
Fedi GHANMI¶
In this Notebook, I will :
- Present the data I have in hand, recite my assumptions and explore my data
- Choose an angle of study and explain phenomenas to reach a conclusion.
- You can clone and run this notebook to see the visualizations that comes with the explanations.
# Package import
import pandas as pd
from dataprep.eda import *
import re
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pypandoc
# ignore warnings library
import warnings
warnings.filterwarnings("ignore")
# mini algorithms for the purpose of preprocessing
def int_regex(text):
"""Keep only int number from string"""
return re.sub(r'\D', '', text)
def get_rounding(ville):
""" Get the district number based on city name """
if ville == "BOIS DE VINCENNES":
return 12
elif (ville == "HAUTS-DE-SEINE") or (ville == "VAL-DE-MARNE") or (ville == "SEINE-SAINT-DENIS"):
return 3
elif ville == "BOIS DE BOULOGNE":
return 16
else:
return 'Missing'
Load the Data and Perform Needed preprocessings.¶
- Load data in csv format.
- Delete non informative columns.
- Split geo_point_2d columns into Latitude column and longitude column
- Preprocess "ARRONDISSEMENT" columns to keep it purely numbers.
# Data Loading
paris_trees_set1 = pd.read_csv("data_csv/paris_trees_set1.csv", low_memory= False)
paris_trees_set2 = pd.read_csv("data_csv/paris_trees_set2.csv", low_memory= False)
paris_trees = paris_trees_set1.append(paris_trees_set2, ignore_index=True)
claims = pd.read_csv("data_csv/dans-ma-rue.csv", low_memory=False, sep=";")
We have 11 columns and more than 200k observations, which corresponds to more than 200k tree planted across Paris.
- We will assume that this is the most up-to-date data of Paris trees since the data source indicates that 29 Septemnber 2022 was the last modification date of the data.
- We will assume that the data posted by the source is true and reflects trees distribution in Paris.
paris_trees.head()
| IDBASE | TYPE EMPLACEMENT | DOMANIALITE | ARRONDISSEMENT | COMPLEMENT ADRESSE | NUMERO | LIEU / ADRESSE | IDEMPLACEMENT | LIBELLE FRANCAIS | GENRE | ESPECE | VARIETE OUCULTIVAR | CIRCONFERENCE (cm) | HAUTEUR (m) | STADE DE DEVELOPPEMENT | REMARQUABLE | geo_point_2d | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2007514 | Arbre | Alignement | BOIS DE VINCENNES | NaN | NaN | ROUTE DU PESAGE | 000101017 | Platane | Platanus | x hispanica | NaN | 205 | 27 | Adulte | NON | 48.82496365567108,2.4461391819662435 |
| 1 | 2031959 | Arbre | Alignement | PARIS 13E ARRDT | NaN | NaN | RUE EUGENE FREYSSINET | 000104004 | Aulne | Alnus | incana | ''Aurea'' | 20 | 2 | Jeune (arbre) | NON | 48.83339630750579,2.3709812332986244 |
| 2 | 151442 | Arbre | CIMETIERE | HAUTS-DE-SEINE | NaN | NaN | CIMETIERE DE BAGNEUX / AVENUE DE L''AULNAIE / ... | A03200096004 | Aulne | Alnus | cordata | NaN | 0 | 0 | NaN | NaN | 48.8025849381091,2.3080238449792954 |
| 3 | 279207 | Arbre | Alignement | PARIS 13E ARRDT | NaN | NaN | RUE THOMIRE | 000202002 | Erable | Acer | platanoides | ''Crimson King'' | 75 | 15 | Adulte | NON | 48.81975318970025,2.348187507873684 |
| 4 | 291674 | Arbre | Alignement | PARIS 1ER ARRDT | 6 | NaN | RUE DU COLONEL DRIANT | 000202009 | Chêne | Quercus | robur | ''Fastigiata'' | 90 | 10 | Jeune (arbre)Adulte | NON | 48.86334886328951,2.340765942541643 |
# delete non informative columns
non_info_cols_paris = ["IDBASE", "TYPE EMPLACEMENT", "COMPLEMENT ADRESSE",
"NUMERO", "IDEMPLACEMENT", "REMARQUABLE"]
non_info_cols_claims = ["ID DECLARATION", "TYPE DECLARATION", "SOUS TYPE DECLARATION",
"VILLE", "DATE DECLARATION", "OUTIL SOURCE", "INTERVENANT",
"ID_DMR", "geo_shape", "mois_annee_decla"]
paris_trees.drop(non_info_cols_paris, inplace=True, axis = 1)
claims.drop(non_info_cols_claims, inplace = True, axis = 1)
# splitting column into two other columns
claims['latitude'] = claims['geo_point_2d'].apply(
lambda x: float(x[0:x.find(",")]) if not pd.isnull(x) else x)
claims['longitude'] = claims['geo_point_2d'].apply(
lambda x: float(x[x.find(",")+1:-1]) if not pd.isnull(x) else x)
# splitting column into two other columns
paris_trees['latitude'] = paris_trees['geo_point_2d'].apply(
lambda x: float(x[0:x.find(",")]) if not pd.isnull(x) else x)
paris_trees['longitude'] = paris_trees['geo_point_2d'].apply(
lambda x: float(x[x.find(",")+1:-1]) if not pd.isnull(x) else x)
# Keeping "ARRONDISSEMENT" Column clean.
paris_trees['ARRONDISSEMENT'] = paris_trees['ARRONDISSEMENT'].apply(
lambda x: int_regex(x) if any(chr.isdigit() for chr in x) else x)
paris_trees['ARRONDISSEMENT'] = paris_trees['ARRONDISSEMENT'].apply(
lambda x: get_rounding(x) if not any(chr.isdigit() for chr in x) else x)
paris_trees["ARRONDISSEMENT"] = paris_trees["ARRONDISSEMENT"].astype(int)
Start Analysis¶
- Describe some statistical features of data variables
- Search for meaning and insight.
# Plot Box plot
plot(paris_trees, "latitude" , display=["Box Plot"])
0%| | 0/41 [00:00<?, ?it/s]
# Plot Box plot
plot(paris_trees, "longitude", display=["Box Plot"])
0%| | 0/41 [00:00<?, ?it/s]
Description: These 2 bar plots Shows the ranges of latitude and longitude limits of our data. for the latitude, it goes from 48.76 to 48.91 and a longitude of 2.21 to 2.47. These ranges corresponds to the delimitations of paris districts. From 1st to 20th district. (Arrondissement). We can see that our 25% quantile and 75% quantile exist in the range of respective latitude and longitude of 48.83 to 48.87 and 2.3 to 2.83. Since our observations in our data are trees, we can conclude that more than 50% of our trees exist in these cooridnates. But what are these coordinates exactly ?
# Plot Viz
plot(paris_trees, "DOMANIALITE", display=["Bar Chart", "Value Table"])
0%| | 0/46 [00:00<?, ?it/s]
| Value | Count | Frequency (%) |
| Alignement | 106206 | |
| Jardin | 48933 | |
| CIMETIERE | 31894 | |
| DASCO | 7254 | 3.5% |
| PERIPHERIQUE | 5259 | 2.6% |
| DJS | 4726 | 2.3% |
| DFPE | 1364 | 0.7% |
| DAC | 61 | < 0.1% |
| DASES | 39 | < 0.1% |
Description: In paris we have more than 200 thousand trees planted. If we check our Bar Chart, we see that "alignement" type of trees are very abundant with more than 100 thousand tree across paris with a percentage of more than 50% of the total green space of paris. Next in line are "Jardin" and "Cimeterie" which if combined reach around 40% of paris green space.
# subsetting of paris trees into aligmenent and jardin
alignement = paris_trees.loc[
paris_trees["DOMANIALITE"] == "Alignement",:]
jardin = paris_trees.loc[
paris_trees["DOMANIALITE"] == "Jardin",:]
# Plot vizualization
plot(alignement, "GENRE", display=["Pie Chart", "Value Table"])
0%| | 0/46 [00:00<?, ?it/s]
| Value | Count | Frequency (%) |
| Platanus | 35686 | |
| Aesculus | 16185 | |
| Tilia | 12040 | |
| Sophora | 8633 | |
| Acer | 5808 | 5.5% |
| Celtis | 3040 | 2.9% |
| Corylus | 2363 | 2.2% |
| Pyrus | 2329 | 2.2% |
| Fraxinus | 2278 | 2.1% |
| Prunus | 1861 | 1.8% |
| Other values (73) | 15983 |
Description: Alignement trees are mainly constituted of Platanus strain. This strain account for more than 30% of alignement trees across Paris. Aesculus and Tilia are less frequent but still exist with good proportions ranging respectively to 15% and 11%. Below are some images of these strains, which are actually familiar to us.
Platanus:
Aesculus
# Plot Pie chart
plot(jardin, "GENRE", display=["Pie Chart", "Value Table"])
0%| | 0/46 [00:00<?, ?it/s]
| Value | Count | Frequency (%) |
| Tilia | 4974 | |
| Acer | 4724 | |
| Pinus | 3857 | 7.9% |
| Prunus | 3400 | 6.9% |
| Aesculus | 2878 | 5.9% |
| Quercus | 2403 | 4.9% |
| Platanus | 2045 | 4.2% |
| Sophora | 1537 | 3.1% |
| Betula | 1509 | 3.1% |
| Fagus | 1492 | 3.0% |
| Other values (156) | 20114 |
Description : If we check the trees distribution in the gardens "jardin", we see that gardin trees are very diversified and there is no one strain that is dominating garden trees. In the Pie Chart, we see equally colored sections in the upper half, and a big pie in the lower half that represents all other minority strains!
Another POV : all these trees and strains are aged differently, so how is their distribution according to their age?
# Plot Statistics
plot(paris_trees, "HAUTEUR (m)", display=["Stats"])
0%| | 0/46 [00:00<?, ?it/s]
Overview
| Approximate Distinct Count | 42.0135 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 3291776 |
| Mean | 8.7805 |
| Minimum | 0 |
| Maximum | 86 |
| Zeros | 25926 |
| Zeros (%) | 12.6% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 5 |
| Median | 8 |
| Q3 | 12 |
| 95-th Percentile | 20 |
| Maximum | 86 |
| Range | 86 |
| IQR | 7 |
Descriptive Statistics
| Mean | 8.7805 |
|---|---|
| Standard Deviation | 5.9339 |
| Variance | 35.2114 |
| Sum | 1.8065e+06 |
| Skewness | 0.5655 |
| Kurtosis | 0.2438 |
| Coefficient of Variation | 0.6758 |
Description : if we study the variable stats, more than 12% of our trees have zero height. which means they are recently planted (having less than 1 meter height). Our variable median is equal to 8 meaning more than 50% of trees have a height superior to 8 meters. having a mean of around 8.7 indicate that trees height distribution is more or less normally distributed with a small positive skewness. meaning that most of our trees revolves around a height of 8.7 meters.
# more than 50% of our plants are mature(70cm cironference)
plot(paris_trees, "CIRCONFERENCE (cm)", display=["Stats"])
0%| | 0/46 [00:00<?, ?it/s]
Overview
| Approximate Distinct Count | 460.6149 |
|---|---|
| Approximate Unique (%) | 0.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 3291776 |
| Mean | 81.0143 |
| Minimum | 0 |
| Maximum | 2246 |
| Zeros | 20121 |
| Zeros (%) | 9.8% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 30 |
| Median | 70 |
| Q3 | 115 |
| 95-th Percentile | 200 |
| Maximum | 2246 |
| Range | 2246 |
| IQR | 85 |
Descriptive Statistics
| Mean | 81.0143 |
|---|---|
| Standard Deviation | 62.9701 |
| Variance | 3965.2312 |
| Sum | 1.6668e+07 |
| Skewness | 1.3598 |
| Kurtosis | 9.7232 |
| Coefficient of Variation | 0.7773 |
Description : For the width of these plants or trees, we observe that we have a mean of 81 cm and a deviation of 62 cm. This huge deviation is due to some outliers that have unkown origins. if we see the maximum value, it is written 2246 cm, which is an irregularity obviously. but for the 95th percentile, we see trees having 200cm width.
# Plot Scatter Plot
fig = px.scatter(paris_trees.sample(5000, random_state=1), x="CIRCONFERENCE (cm)", y='HAUTEUR (m)')
fig.show()
Description : The scatter plot of these 2 variables indicate a positive correlation between them. which is totally normal. The more a plant has bigger width we expect it to have a higher height.
# Plot Heatmap
plot_correlation(paris_trees, display=["Pearson"])
Description of Height and width : According to this heatmap, Width and height are correlated with a 80% pearson coefficient. This can mean that 80% of the time when a plant has higher width, it has higher height. But we expected this relationship for all plants. What happened to the rest of the 20%? We can explain that 20% of the times can be due to difference in plant strains. Some strains may reach a limit in its width while having a lesser height than another strain.
Description of other variables : For the remaining variables, we see a very weak correlation, suggesting there is not a direct relashionship between latitude, longitude, and districts between each other.
Another POV : Now after studying the types of plants, their strain and their variable correlations, let us examine their location in relation to paris districts.
# Subsetting and Map Display
arrondissement_paris = pd.DataFrame(pd.pivot_table(paris_trees, index=["ARRONDISSEMENT"]
,aggfunc="size"), columns=["Occurences"])
arrondissement_rec = pd.DataFrame(pd.pivot_table(claims, index=["ARRONDISSEMENT"] ,aggfunc="size"),
columns=["Occurences"])
Description : The distribution of trees across paris is generalized in the above map. The greener the district, the more trees it have, ranging from 500 trees to the greenest district which have more than 25k plants. We notice that the outer districts are more rich in plants than the inner ones. Starting from the 12th district until the 20th (last) one. Since 2020, The French republic released an application which is called "Dans ma Rue". Its purpose is to declare or claim online to the municipality any aberration that may happen on the streets. Among all of these claims, we will study the distribution of complaints regarding trees and plants.
# Plot Value Table
plot(claims, "ANNEE DECLARATION", display=["Value Table"])
0%| | 0/45 [00:00<?, ?it/s]
| Value | Count | Frequency (%) |
| 2022 | 3837 | |
| 2021 | 3736 |
Description : According to this Value Table, we see that our historic data is uptodate and contains the past 2 years complaints with half of all complaints are of 2021 and the other half in 2022.
Description : The distribution of complaints have the same color gradients as of distribution of trees. with a more red area containing more complaints. These complaints range from 80 claims to more than 800 per district. We see also the 15th district having the most complaints in all of paris. But why is that ?
# Subsetting
paris_lat_long = paris_trees.drop(["LIEU / ADRESSE", "LIBELLE FRANCAIS", "geo_point_2d"], axis = 1)
paris_ext= paris_lat_long.loc[(paris_lat_long["ARRONDISSEMENT"] >= 12), :]
paris_int= paris_lat_long.loc[(paris_lat_long["ARRONDISSEMENT"] < 12), :]
# cimeterie = 40% > 35% du zone exterieur.
plot(paris_int, "DOMANIALITE", display=["Pie Chart"])
0%| | 0/9 [00:00<?, ?it/s]
Description : The above pie chart summarizes the type of forestry in the interior districs from first to 12th district. we see around 40% are for cimeteries and the other 40% is for alignement. and a minority percentage for gardens reaching 13% of green spaces. Having small number of complaints in these regions may only conclude that cimeteries and alignement are in good shape.
# Plot Pie Chart
plot(paris_ext, "DOMANIALITE", display=["Pie Chart"])
0%| | 0/9 [00:00<?, ?it/s]
description : The pie Chart above summarizes the types of plants on the exterior part of paris. Districts from 12 to 20. We see that 55% are alignement plants and surprisingly about 30% are Gardens. We can thus explain the important number of complaints due to maybe not clean gardens. garden not well maintained by the municipality, overflow of vegetation, ...
# Subtype declaration word cloud display
claims_dec = pd.read_csv("data_csv/dans-ma-rue_v2.csv", low_memory= False, sep=";")
claims_dec_ext = claims_dec.loc[claims_dec["ARRONDISSEMENT"] >= 12,:]
plot(claims_dec_ext, "SOUS TYPE DECLARATION", display=["Word Cloud"])
0%| | 0/15 [00:00<?, ?it/s]
Description : The word cloud above summarizes the complaints posted by users of the applications. We see the words "arbre", "herbes", "animaux", "insecteprésence", "jardiniere", "animal", "rat", ... All of these vocabulary confirms what we previously concluded. Gardens are not well maintained. So we ask the question: What is the future of Paris forestry?
Conclusion : Dans ma rue application has helped gather data about areas which needs most taking care of. The only thing remaining now is to act upon this data and save paris forestry!